Near Data Scheduling for Data Centers with Multi Levels of Data Locality

نویسنده

  • Ali Yekkehkhany
چکیده

Data locality is a fundamental issue for data-parallel applications. Considering MapReduce in Hadoop, the map task scheduling part requires an efficient algorithm which takes data locality into consideration; otherwise, system may get unstable under loads inside the system’s capacity region or jobs may experience longer completion times which are not of interest. The data chunk needed for any map task can be in memory, on a local disk, in a local rack, in the same cluster or even in another data center. Hence, unless there has been so much work on improving the speed of data center networks, still there exists different levels of service rates for a task depending on where its data chunk is saved and from which server it receives service. Most of the theoretical work on load balancing is for systems with two levels of data locality among which I can name Pandas algorithm by Xie et al. and JSQ-MW by Wang et al., where the former is both throughput and heavytraffic optimal, while the latter is only throughput optimal, but heavy-traffic optimal in only a special traffic load. We show that an extension of JSQMW for a system with thee levels of data locality is throughput optimal, but not heavy-traffic optimal for all load, but again for a special traffic scenario. Furthermore, we show that Pandas is not even throughput optimal for a system with three levels of data locality. We then propose a novel algorithm, Balanced-Pandas, which is both throughput and heavy-traffic optimal. To the best of our knowledge, this is the first theoretical work on load balancing for a system with more than two levels of data locality which as we will see is more challenging than two levels of data locality as a dilemma between performance and throughput emerges.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Near - Data Scheduling for Data Centers with Multiple

Data locality is a fundamental issue for data-parallel applications. Considering MapReduce in Hadoop, the map task scheduling part requires an efficient algorithm which takes data locality into consideration; otherwise, the system may become unstable under loads inside the system’s capacity region and jobs may experience longer completion times which are not of interest. The data chunk needed f...

متن کامل

Locality Information Based Scheduling in Shared Memory Multiprocessors

Lightweight threads have become a common abstraction in the field of programming languages and operating systems. This paper examines the performance implications of locality information usage in thread scheduling algorithms for scal-able shared-memory multiprocessors. The elements of a distributed scheduler using all available locality information as well as experimental measurements are prese...

متن کامل

Affinity Scheduling and the Applications on Data Center Scheduling with Data Locality

MapReduce framework is the de facto standard in Hadoop. Considering the data locality in data centers, the load balancing problem of map tasks is a special case of affinity scheduling problem. There is a huge body of work on affinity scheduling, proposing heuristic algorithms which try to increase data locality in data centers like Delay Scheduling and Quincy. However, not enough attention has ...

متن کامل

A new multi-objective bi-level programming model for energy and locality aware multi-job scheduling in cloud computing

How to reduce power consumption of data centers has received worldwide attention. By combining the energy-aware data placement policy and locality-aware multi-job scheduling scheme, we propose a new multi-objective bi-level programming model based on MapReduce to improve the energy efficiency of servers. First, the variation of energy consumption with the performance of servers is taken into ac...

متن کامل

Towards Makespan Minimization Task Allocation in Data Centers

Nowadays, data centers suffer from resource limitations in both the limited bandwidth resources on the links and the computing capability on the servers, which triggers a variety of resource management problems. In this paper, we discuss one classic resource allocation problem: task allocation in data centers. That is, given a set of tasks with different makespans, how to schedule these tasks i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1702.07802  شماره 

صفحات  -

تاریخ انتشار 2017